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Abstract. Differentially private mechanisms enjoy a variety of composi- 
tion properties. Leveraging these, McSherry introduced PINQ (SIGMOD 
2009), a system empowering non-experts to construct new differentially 
private analyses. PINQ is an LINQ-Iike API which provides automatic 
privacy guarantees for all programs which use it to mediate sensitive 
data manipulation. In this work we introduce featherweight PINQ, a for- 
mal model capturing the essence of PINQ. We prove that any program 
interacting with featherweight PINQ’s API is differentially private. 
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1 Introduction 

Differential privacy mm shows that by adding the right amount of noise 
to statistical queries, one can get useful results, and at the sarne time provide a 
quantifiable notion of privacy. The definition of differential privacy for a query 
mechanism (a randomized algorithm) is made by comparing the results of a query 
on any database with or without any one individual: a query Q is e-differentially 
private if the difference in probability of any query outcome on a data-set only 
changes by a factor of e E (approximately 1+e for small e) whenever an individual 
is added or removed. 

Of the rnany of papers on differential privacy, a mere handful (at the time 
of writing) describe implemented systems which provide more than just a static 
collection of differentially private operations. The first such system is the PINQ 
system of McSherry [9]. PINQ is designed to allow non-experts in differential- 
privacy to build privacy-preserving data analyses. The system works by leverag- 
ing a fixed collection of differentially private data aggregation functions (counts, 
averages, etc.), and a collection of data manipulation operations, all embedded 
with a LINQ-like interface from otherwise arbitrary Cff code. PINQ mediates all 
accesses to sensitive data in order to keep track of the sensitivity of various com- 
puted objects, and to ensure that the intended privacy budget e is not exceeded; 
a budget could be exceeded by answering too many queries with too high accu- 
racy. In this way PINQ is intended to make sure that the analyst (programmer) 
does not inadvertently break differential privacy. 



Foundations of PINQ McSlierry argues the correctness of PINQ by pointing 
out the foundations upon which PINQ rests. In essence these are: 

1. A predehned collection of aggregation operations (queries) on tables, each 
with a paranreter specifying the required degree of differential privacy. Stan- 
clard aggregation operations such as (noisy) count and average are imple- 
mented. The core assumption is that each aggregation operation Q with 
noise parameter e, written here as Q e , is an e-differentially private ran- 
clomised function. 

2. Sequential composition principle: if two queries performed in sequence (e.g. 
with differential privacy £\ and £2 respectively) then the overall level of 
differential privacy is safely estimated by summing the privacy costs of the 
individual queries (£1 +£ 2 ). 

3. Parallel composition principle: if the data is partitioned into disjoint parts, 
and a different query is applied to eacli partition, then the overall level of 
differential privacy is safely estimated by taking the maximum of the costs 
of the individual queries. 

4. Stability composition: the stability of a database transformation T is defined 
to be c if when ever you add n extra elements to the argument of T, the 
result of T changes by no rnore than n x c elements. If you first transform 
a database by T, then query the result with an £-private query, the privacy 
afforded by the composition of the two operations is safely approximated by 
c x £. 

Tliese ” foundations” of PINQ provide an intuition about how and why PINQ 
works, but although a novel aim of PINQ was “providing formal end-to-end clif- 
ferential privacy guarantees under arbitrary use”, the foundations are inadequate 
to builcl an end-to-end correctness argument since they fall short of describing 
number of PINQ features of potential relevance to the question of its differential 
privacy: 

— Sequential composition is described in an oversimplified way, assuming that 
the queries are chosen independently from eacli other. In practice the second 
query of a sequence is issued by client code which lras received the result of 
the first query. Thus the second query may depend on the outcome of the 
first. To argue correctness this adaptiveness should be modelled explicitly. 

— Parallel queries partition data, but the data which is partitioned rnight not 
be the original input, but some intermediate table. The informal argument 
for taking the maximum of the privacy costs of the query on each partition 
relies on the respective queries applying to disjoint data points. But the data 
might not be disjoint when seen from tlre perspective of the original data set 
of individuals. Data derived from a participant might end up in rnore than 
one partition, so a correctness argument must rnodel this possibility to slrow 
that it is safe. 

— As for sequential composition, parallel queries are not parallcl at all, but can 
be adaptive - the result of a query on one partition might depend on the 
result of a query on another. This means that the implementation greatly 
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complicates the bookkeeping necessary to track the "maximum” cost of the 
queries. 

— The foundations suggest how to compute the privacy cost of composed op- 
erations frorn their privacy and stability properties. But in practice PINQ 
does not measure the amount of privacy lost by a PINQ program, it enforces 
a stated bound. Because of this, there are two kinds of results from a query: 
the normal noisy answer, or an exception. An exception is thrown if answer- 
ing the query normally would break the global privacy budget. To prove 
differential privacy it is not enough that the query is differentially private in 
the normal case - it must also be shown to be private in the case when an 
exception is thrown, since this information is communicated to the program. 

In this paper we provide a foundation for PINQ by defining a minimalistic se- 
mantics, Featherweight PINQ , intended to model it’s essence, while at the same 
time abstracting away from less relevant implementation details. By idealising 
the interface we rnake clear the intended implementation, but not the details 
of its realisation in any particular language. Thus we model the client program 
completely abstractly as a deterministic labelled transition system which inter- 
acts with tables via the PINQ-like API but which is otherwise unconstrained. 
For this model we instantiate the definition of differential privacy, taking into ac- 
count the interactive nature of the systern, and prove that Featherweight PINQ 
provides differential privacy for any program. 

2 PINQ 

In this section we provide a brief description of the PINQ systern frorn the 
user perpective. PINQ is a .NET API which provides an interface similar to 
the Language Integrated Queries (LINQ) that is a language extension to .NET. 
Analyses that use PINQ are typically written in Cff. 

Listing fLll shows a code fragment for a sample analysis producing tlie average 
ages of adult rnales and adult females, respectively and then separately compute 
the average of age for all individuals. 

Listing 1.1. PINQ sample code 
var agent = new PINQAgentBudget(budget); 

var data = new PINQueryable<recordstype>(rawdata, agent); 
var adults = data.Where(x => x.age > 17); 
var genders = new [] {0,1}; 

var parts = adults.Partition(genders, x=>x.gender); 
foreach (var a in genders) { 

Console.WriteLine( "Average age of {0} is {1}", 
a==0? "Males " : "Females ", 
parts[a].NoisyAverage(budget/2, x=>x.age) 

} 

Console .WriteLineC'Average age (all) : " + data. NoisyAverage(c) ) 

The first two lines of the program initialises a PINQueryable object with sample 
sensitive data (rawdata) structures and set the privacy limit (budget). A PIN- 


Queryable object is a wrapper to the database which enables PINQ to track the 
properties that are relevant for differential privacy. The supplied “agent” param- 
eter expresses the amount of differential privacy that the system will enforce on 
this database. 

The analysis starts by selecting (line 3) a subset of records of interest (those 
who are adults). Behind the scenes PINQ records the fact that the stability of 
data is unchanged: adding a single record to the rawdata does not change the 
size of the result of this transformation by more than a single record. 

In line 5 a partitioning operation splits the data into two groups based on 
the gender field (0 for Male, 1 for Female). Partition is not a standard LIN- 
Q/SQL style operation, but is specific to PINQ. For each partition (i.e. for each 
gender), the code outputs a noisy average of the age. NoisyAverage is one of a 
collection of built-in differentially private primitive aggregation operations pro- 
vided by PINQ. The amount of differential privacy for each query in the loop 
is budget/2. After executing the foreach loop there will be budget/2 of the 
original budget remaining. The outcome of tlie last line depends on the accu- 
racy/privacy parameter c. If c is larger than budget/2 the program will throw 
an exception (because answering tlie query with that degree of precision would 
break tlie budget). 

3 Idealised Program 

In this section we describe the abstract model of the program and API to 
the PINQ operations. In the section thereafter we go on to rnodel the PINQ 
internals, what we call the protected system , before combining these components 
into a the overall rnodel of Featherweight PINQ. 

The first thing that we will abstract away from is the host programming lan- 
guage. Here one could chose to model a sirnple programming language, but it is 
not necessary to be that concrete. Instead we model a program as an arbitrary 
deterministic systern that maintains its own internal state, and issues commands 
to the PINQ internals. In this sense we idealise PINQ by assuming that the API 
cannot be bypassed. In fact the PINQ system does not successfully encapsulate 
the all the protected parts of the system, and so some programs can violate 
differential privacy by bypassing the encapsulation 0, or by using side effects in 
places where side-effects are not intended. By idealising the interface we make 
clear the intended implementation, but not the details of its realisation in any 
particular language. By treating programs abstractly we also simplify other fea- 
tures of PINQ including aspects of its architecture which promote certain forrns 
of extensibility. 

Before describing the program rnodel it is appropriate to say a few words 
about the protected system (described formally in the next section). The pro- 
tected systern contains all the datasets (tables) manipulated by the program. 
Since these are the privacy sensitive data, we only perrnit the program to access 
them via the API. The protected system tracks the stability of all the tables 
which it maintains, together with a global budget. Our program interacts with 
the protected system by the following operations: 


Assignment Tables in the protected system are referred to via table variables. 
A program can issue an assignment command. The model allows the program 
to manipulate a table using transformation that assign a new value to table 
variables. 

The general form of assignment is of the form tv := F(tv 1 ,..., tv„ ), where 
F is taken frorn a set of function identifiers representing a family of transforma- 
tions with bounded stability (i.e. for each argument position i there is a natural 
number c.; such that if the size of the ith argument changes by n elements, tlien 
the result will change by at most Ci ■ n elements). This stability requirement 
comes frorn PINQ and is discussed in more detail in the next section. Trans- 
formations include standard operations such as the .Where(x => x.age > 17) 
from the example in listing |T~Tj and simple assignments ti := t^ (taking F to be 
the identity function), as well as assignments of literal tables (the case when F 
has arity 0). 

Query The only other operation of the PINQ API is the application of a primi- 
tive differentially private query. In the example above we saw a compound trans- 
formation and query operation parts [a] . NoisyAverage (budget/2, x=>x. age). 
It is sufficient to model just the query, since tlre transformation (x=>x.age) can 
be implemented via an intermediate assignment. Thus we assume a set of prim- 
itive queries Query, ranged over by Q , which take as argument a positive real 
(the e parameter) and a table, and produce a discrete probability distribution 
over a domain of result values Val. 

We generalise the single query operation to a parallel query, with syntax 
query (tv, f,Q,e), where 

1. tv is the table variable referring to the table that will be used for the analysis, 

2. / is the partitioning function that maps each record to an index in codomain(/) 

{1,..., k} for some k £ N, 

3. Q is a vector of k queries from Query. 

The execution of this operation (as described in the next section) involves com- 
puting the sequence of randomised values 

Qi(e, {r G T | f(r) = i}),i £ codomain(/) 

where T is the table bound to tv. This is the “parallel query” operation described 
informally in the description of PINQ [9]. We use a single e for all queries because 
if we chose an £. t for each query the privacy cost will be maximum of all the 
epsilons in any case, so we may as well enjoy the accuracy of the largest epsilon. 
However, we note that the implementation of PINQ is more general than this, 
since the queries on each partition may be performed in an adaptive way. Here 
we are rnaking a trade-off in keeping our model simple at the expense of not 
proving differential privacy for quite as general a system. 

Client Program Model The above abstraction of the PINQ API allows us 
to abstract away from all internal details of tlie programming language using 
the API. Following [B] we model a program as an arbitrary labelled transition 
system with labels representing the API calls: 


Definition 1 (ProgAct Labels). The set of program action labels ProgAct, 
ranged over by a and b, are defined as the union of three syntactic forms: 


1. the distinguished action r, representing computational progress without in- 
teraction with the protected system, 

2. tvar := F(tv i,..., tv n ) where F is a function identifier, i.e. the formal name 
of a transformation operation of arity n, 

3. query (tv, f,Q,e)Tv, where f is a function from records to {l,...,fc} for 
some k > 0, where ~v is a vector of values in Val , and Q is a vector of k 
queries. 

Every label represents an interaction between a client program and the pro- 
tected system. The labels represent the observable output of a system which are 
a sequence of those actions: internal (silent) steps (r) modelling no interaction, 
and vectors of values ~v which are the results of some query being answered and 
returned to the program. 

To define these transitions, we assume a client program modelled by a labelled 
transition systern modelling the API to the protected systern. For clicnt pro- 
grarns, the label corresponding to a query call is of the form queryffr, f, Q, e) ? ~v, 
and rnodels the pair of query and the returned result (as described before) as a 
single event. This allows us to model value passing with no need to introduce 
any specific syntax for programs. Note that the value returned by the query is 
known to the program, and the program can act on it accordingly. Frorn the 
perspective of the program and the protected system togetlrer, this value will be 
considered an observable output of the whole systern. 

Definition 2 (Client Program). A client program is a labelled transition 
system (P,— >,Pq), with labels from ProgAct, where P is all possible program 
states, P 0 is the initial state of the program, and the transition relation —> C 
(P x ProgAct x P) is deadlock-free, and satisfies the following determinacy 

property: for all states P, if P P' and P \ P" then 

1. ifa = b then P' = P'', 

2. if a is not a query then a = b, 

3. if a = query(t«, /, Q,e) ? ~v then b = query(tv, /, Q,s) 1 u for some ~u, and 
for all actions c of the form query(tr>, /, Q, e) ? w there exists a state P c such 
that P A P c . 

The conditions on client programs are mild. Deadlock (i.e. termination) free- 
dom simplihes reasoning; a program that terminates in tlre conventional sense 
can be modelled by adding a transition P A P for all terminated states P. 
Query transitions model both the query sent and the result received. Since we are 
modelling message passing using just transition labels, the condition on queries 
states that the program must be able to accept any result frorn a given query. 
Modulo the results returned by a query, the conditions require the program to 
be deterministic. This is a technical simplification which (we believe) cloes not 
restrict the power of the attacker. 



Remark: Implicit parameters We will prove that Featherweight PINQ pro- 
vides differential privacy for any client program. To avoid excessive parametri- 
sation of subsequent definitions, in what follows we will fix some arbitrary client 
program (P, —>,Pq) and some arbitrary initial budget £ and rnake definitions 
relative to these. 

4 Featherweight PINQ 

In this section we turn to the model of the internals of PINQ, and the overall 
semantics of the system. We begin by describing the components of the pro- 
tected systern, ancl then give tlie overall model of Featherweight PINQ by giving 
a probabilistic semantics (as a probabilistic labelled transition systern) to the 
combination of a client program and a protected system. 

4.1 The Protected System 

Global Privacy Budget The first component of the protected system is the 
global privacy budget. This is a non-negative real number representing the re- 
maining privacy budget. The idea is that if we begin with initial budget b tlien 
Featherweight PINQ will enforce 6-differential privacy. The global budget is 
decremented as queries are computed, and queries are denied if they would cause 
the budget to becorne negative. In PINQ the budget is associated with a given 
data source. In our rnodel we assume that there is only one data source, and 
hence only one budget. Further, PINQ allows the budget to be divided up and 
passed clown to subcomputations. This does not fundamentally change the ex- 
pressiveness of PINQ since, as we show later, we are free to extend Featherweight 
PINQ with the ability to query the global budget clirectly. Thus any particular 
strategy for dividing the global budget between subcomputations can be easily 
programmed. 

The Table Environment The other data component of the protected system 
is the table environment, which maps each table variable to the table it denotes, 
togetlrer with a recorcl of the scaling factor , whicli is a measure of the stability of 
the table relative to tlre initial data set. We define this precisely below. Formally 
we define a table as power-set of records, P(Record), a protected table is a pair 
of a table witli its scaling factor: 

ProtectedTable = f Table x IN 

4.2 The Featherweight PINQ Transition System 

Featherweight PINQ is defined by combining a client program with tlre protected 
system to form the states of a probabilistic transition system. 

Definition 3 (Featherweight PINQ States). The states (otherwise known 
as configurationsj of Featherweight PINQ, ranged over by C, C' etc., are triples 
of the form (P, E, B) where P is a client program state, E £ TVar —> ProtectedTable 
is the table environment, and B £ R + is the global budget. 

There is a farnily of possible initial states, indexed by the distinguished input 
table, and the initial budget. We define these by assuming the existence of a 



distinguished table variable, input, which we initialise with the input table, while 
all other table variables are initialised with the ernpty table: 

Definition 4 (Initial configuration). 

Init(T, B) = (P 0 , E t , B) where E T (tv ) = | ^ ^^ mpUt 

I (|j,(J) otherwise. 

The operational semantics of featherweight PINQ can now be given: 

Definition 5 (Semantics). The operational semantics of configurations is given 
by a probabilistic labelled transition relation with transitions of the form C —t p C' 
where a £ Act = f {r, _L}U|J raeN Val' 1 , and (probability) p £ [0,1]. The defmition 
is given by cases in Figure^ 
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(P,E,B) (P',E,B) 

tv:=F(tvi,...,tv n ) 


Assign 


> P' 


(P,E,B) (P',E[tv^(T',s)],B) 

query(tu,/,cJ,E) ? _L , 
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E n 

i=1 a x Si 

(T' = fF](Ti,...,T„) 


Queryx 


P --— ^ whel jEM = (T,^ 


(P,E,B)±n(P',E,B) (es>B 

' E(tv) = (T,s), e-s<B 


p query(tu,/,(?,e) ? it^ 

Query -x-where < 


(P,E,B) ^4 P (P', E,B — s ■ e) 

Fig. 1 . Operational semantics 


codomain(/) = {l,...,n} Tf £ Val n 
T = {s | s £ T,f(s) = i},i<£ {l,...,n} 
P = nr=i Pr [Q;(e. T i) = v i] 


We note at this point that sorne of the primitives have not yet been defined 
(e.g. stability in the Assign rule), and that the rules of the system do not, a 
priori, define a probabilistic transition system. We will elaborate these points in 
what follows. We begin by explaining the rules in turn. 

Assign When a program issues an assignment command tv := F(tv i,..., tv n ), 
the value of the stored table for tv is updated in the obvious way. We must also 
record the scaling factor of the table thus computed. The scaling factor is com- 
puted from the scaling factors of the tables for tv\,..., tv n , and the stability of 
the transformation /. We assume a mapping [•] from formal function identifiers 
F to the actual table transformation functions [F] of corresponding arity. 

Definition 6. A table transformation f of arity n has stability (c\,... ,c n ) if 
for all i £ { 1,..., n}, we have 

\f(T 1 ,...,T i ,...,T n )Qf(T 1 ,...,T',...T n )\<c i x \TiQT'\ 












This is the n-ary generalisation of McSherry’s definition [9], and bounds the 
size cliange in a result in terms of the size change of its argument. This is rnade 
rnore explicit in the following: 

Lemma 1. If f has stability (ci,..., c„) then |/(Ti,..., T n ) Q /(T{,..., T n )\ < 
E?(ci x \TiQT[\) 

Note that not all functions have a finite stability. An example of this is the 
database join operation (essentially the cartesian product); adding one new el- 
ement to one argument will add k new elements to the result, where k is the 
size of the other argument. Thus there is no static bound on the nurnber of 
elements that may be added. Thus PINQ (and hence Featherweight PINQ) sup- 
ports only transformation operations which have a finite stability. Table |4~2| illus- 
trates the stability of some of the transformations that are introduced in PINQ. 
The variant of the join operation, Join* deterministically produces bounded 
numbers of join elements. For the purpose of this paper we do not need to be 
specific about the transformations. We simply assume the existence of a func- 
tion stability which soundly returns the stability of a function identifier, i.e., if 
stability(F) = (ci,..., c n ) then [F] has stability (ci,..., c n ). 

The transition rule for assignment in featherweight PINQ is thus 

{ E(tVi) = ( Ti,s z ),i S {l,...,n} 
stability(F) = (ci,...,c n ) 
s = £i=l Ci x 8i 
T' = P r ](Ti,...,T n ) 

The label on the rule r says that nothing (other than computational progress) 
is observable from the execution of this computation step. The subscript 1 is the 
probability with which this step occurs. 


Transformation 

Stability 

Select(T, maper) 

(1) 

Where(T, predicate) 

(1) 

GroupBy(Ti, keyselector) 

(2) 

Join*(Ti,T 2 , n, m, keyselectori, keyselector^) 

(n,m) 

Intersect (Ti ,T 2 ) 

(1.1) 

Union(Ti,T 2 ) 

(1.1) 

Partition(T, keyselector, keysList) 

(1) 


Table 1. Transformation stability 


Understanding the scaling factor Here we provide more intuition about 
the scaling factor calculations, and explain some differences between the PINQ 
implementation and the Featherweight PINQ rnodel. As an example, suppose 
we have a computation of a series of tables A-G depicted in Figure[2j 

The figure represents a PINQ computation involving three unary transfor- 
mations (producing B, C , and D ), one binary transformation producing G, and 
one partition operation (splitting C into E and F). We have labelled the trans- 
formation arcs with the stability constants of the respective transformations. 







Fig. 2. Transformations 
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Calculation 

A 

1 

Input table 

B 

2 

s(A) x 2 

C 

1 

s(A) x 3 

D 

10 

s(B) x 5 

E 

3 

s(C) 

F 

3 

s(C) 

G 

22 

s(D) x 1 + s(E) x 4 


Fig. 3. Scaling factors (s) 


What is the privacy cost of an e differentially private query applied to, say, table 
D1 Since D is the result of two transformations on the input data, the privacy 
cost is higher than just e. The product of the sensitivities on the patlr from D 
backwards to the input A provide the scaling factor for e. In this case the scaling 
factor for a query on D is 10. The remaining scaling factors are summarised in 
the table in Figure[3] 

The scaling factor is the stability of that specific table; it bounds the max- 
imum possible change of the table as a result of a change in the input dataset, 
assuming that it was produced using the sarne sequence of operations. The scal- 
ing factor is computed frorn the stabilities of transformations that produced it. 
The scaling factor for each protected table (except input table which has the scal- 
ing factor one) is computed compositionally using the scaling factors (sf) of all 
the arguments and the sensitivities of corresponding transformation arguments 
(Ci) using the following formula: sa = J2ie P arent(A) c i x s * 

Figures [2] and [3] allow us to explain two key differences between PINQ and 
our rnodel: 

1. In PINQ, the tree structure depicted in the figure is represented explicitly, 
and scaling factors are calculated lazily: at the point where a query with 
accuracy e is made on a table it is necessary to calculate its scaling factor 
s in order to determine the privacy cost s.e. To do this the tree is traversed 
from the query at the leaf back to the root, calculating the scaling factor 
along the way. At the root the total privacy cost is then known and deducted 
from the budget (providing the budget is sufficient). In Featherweight PINQ 
the scaling factors of each table are computed eagerly, so the tree structure 
is not traversed. 

2. In Featherweight PINQ we restrict the partition operation to the leaves of the 
tree, and combine it with the application of primitive queries to partitions. 

The consequence of these two simplifications is that we do not need to represent 
the PINQ computation tree at all - all computations are rnade locally at the 
point at which a table is produced or queried. 

Queries Parallel queries were described in detail in the previous section. Wlien 
a program issues a query is it represented as a parallel query and a possible 
result - i.e. we model the query and the returned result as a single step. There 
are two cases to consider, according to whether the budget is sufficient or not. If 
the queried table T has scaling factor s then the cost of an e query is s x e. If this 



















is greater than the current global budget then the result is the exceptional value 
X. This value is the observable result of the query, and it occurs with probability 
1. On the otlrer hand, if the budget is sufficient, then the vector of query results 
~v is returned with probability p = JX, Pr[<3i(e, Ti) = Vi\ where Ti is the ith 
partition of T. Note that p is indeed a probability, since the conrponent queries 
are independent. 

5 Differential Privacy for Featherweight PINQ 

In this section we prove that Featherweight PINQ is differentially private. 
We begin by recapping the goafs of differential privacy, before showing how to 
specialise the definition to Featherweight PINQ. Doing tliis entails building a 
trace semantics for Featherweight PINQ. 

Differential privacy, guarantees that a data query mechanism (abstractly, a 
randomized algorithm) behaves similarly on similar input databases. This “sim- 
ifarity” is a quantitative measure e on the difference in the information obtained 
from any data set with or without any individual. When this difference is srnall, 
the presence or absence of the individual in the data set is difficult to ascertain. 

Definition 7. Mechanism f provides e-dijferential privacy if for any two datasets 
A and B that differ in one record (\ A © B |= \), and for any two possible out- 
come f(A) and f(B), the following inequalities holds : e~ e < < e e 

In tliis definition, S is subset of the range of outcomes for f (S C Range(/)) 
and for similarity of outcomes we use the ratio between the probabilities of 
observing outcomes Pr \(f (datasets) G S/] when the analyses are executed on 
any two similar datasets A and B. Finalfy for similarity of datasets hamming 
distance is used as a rnetric. In this work we assurne that the primitive query 
mechanisms (and thus Featherweight PINQ) provide answers over a discrete 
probability distribution, so that it is sufficient to consider S to be a singleton 
set. 

5.1 Trace semantics 

The first step to instantiating the definition of differentiaf privacy to Feather- 
weight PINQ is to be able to view Featherweight PINQ as defining a probabilistic 
function. In fact each client program gives rise to a family of probabilistic func- 
tions, one for each length of computation that is observed. This is given by 
building a trace semantics on top of the transition systern for Featherweight 
PINQ. 

The semantics of Featherweight PINQ is a probabilistic labelled transition 
system of the simplest kind: for each configuration C, the sum of afl probabilities 
of alf transitions of C is equal to 1. The system is also deterministic, in the sense 
that if C A Pl Ci and C A- P2 C 2 then pi = p 2 and Ci = C 2 . This makes it 
particularly easy to lift the probabilistic transition system from single actions to 
traces of actions: 



Definition 8 (Trace semantics). Define the trace transitions => C Config x 
Act* x [0,1] x Config inductively as follows: (i) C C where [] € Act* is the 
empty trace, and (ii) if C C' and C' 4> ? C" then C => p . q C" 

Traces inherit determinacy from the single transitions: 

Proposition 1 (Traces are Deterministic). If C 4> Pl Ci and C 4> P2 C 2 
then pi = P 2 and Ci = C 2 


This follows by an easy induction on the trace, using the fact that the single 
step transitions are similarly deterministic. 

Lemma 2 (Traces are Probabilistic). Define 

M(C 

I 0 otherwise. 

For all configurations C, and all n > 0, 

E Mc.t) =1 

t^Act n 

where Act n is the set of traces of length n. 


The proof is a simple induction on n, using the proposition above. The lemma 
says that whenever C 4> p , then p is tlie probability that you see trace t after 
having observed size(f) steps of the computation of C. We will thus refer to the 
probability of a given trace to rnean the probability of producing that trace from 
the given configuration among all traces of the same lengtlr. We denote this by 

writing Pr[C 4>] = p when C 4> p . 

DifFerential Privacy for Traces We are now in a position to specialise the def- 
inition of differential privacy for Featherweight PINQ. How can we view Feath- 
erweight PINQ as a probabilistic function? The probabilistic function is deter- 
mined by the client program (which we have kept implicit but unconstrained), 
the initial budget e, and the length of trace n that is observed for any combina- 
tion of these we define the function which maps a table T to trace t of lengtlr n 
with probability p precisely when Pr[Init(T, e) 4>] = p. 

The instantiation of the differential privacy condition to Featherweight PINQ 
is thus: 


\/t,T,T',£. if |T©T'| = 1 tlren e _£ < 


Pr[Init(T, e) 


< e £ 


Pr^Init^Tbe) =>] 

Towards a proof of this property we introduce sorne notation to reflect key 
invariants between the pairs of computations (for T and T' respectively). 


Definition 9 (Similarity). We define similarity relations ~ between tables, 
environments, and configurations as follows: 


— For tables T and T', and s £ N define T T' (‘T is s-similar to T'”) if 
and only if \T © T'| < s. 



— For protected environments E and E', define E ~ E' if and only if for all 

tv, if E{tv ) = (T, s ) and E'(tv) = (T ', s') then s = s' and T ~ s T'. 

— For configurations, define ( P,E,B) ~ ( P',E',B') if and only if P = P', 

E ~ E' and B = B'. 

The configuration similarity relation captures the key invariant between the 
two computations in our proof of differential privacy. First we need to show that 
the invariant is established for the initial configurations: 

Lemma 3. IfT ~i T' then Init (T,B) ~ Init(T’ , I I?). 

This follows easily form the definition of the initial configuration. Now the 
main theorem shows that this is maintained throughout the computation: 

Theorem 1. If T ~i T' and Init (T,B) C, then Init (T',B) =l- g C' where 
C ~ C' and p < q. exp (B — e) for some e < B. 

The proof, an induction over the length of the trace, is given in Appendix[A| 

Corollary 1 (S-differential privacy). IfT ~i T' and Pr[Init(T, B) 4>[ =p 
then Pr[Init(T 7 , B) =4] = q for some q such that p < q ■ exp(T). 

6 Related Work 

Tlre approach described in this paper owes much to the model used in the 
formalisation developed in our recent work on personalised differential privacy 
[6;. The idea to model the client program as an abstract labelled transition 
system comes from that work. That work also shows how dynamic inputs can 
be handled without major difficulties. 

The closest other prior work is developed by Tschantz et al [12] ■ Their work 
introduces a way to model interactive query mechanisms as a probabilistic au- 
tomata, and develop bisimulation-based proof techniques for reasoning about the 
differential privacy of such systems. As a running example they consider a sys- 
tem “similar to PINQ”, and use it to demonstrate their proof techniques. From 
our perspective their system is significantly different from PINQ in an number of 
ways: (i) it does not model the transformation of data at all, but only queries on 
unmodified input data, (ii) it models a system with a bounded amount of mem- 
ory, and implements a mechanism which deletes data after it has been used for a 
fixed number of queries (neither of which relate to the implementation of PINQ). 
Regarding the proof techniques developed in H3, as previously noted in [BJ , a 
key difference between our formalisation and theirs is that they model a passive 
system which responds to external queries frorn the environment. In contrast, 
our model includes the adaptive adversary (the client program) as an explicit 
part of the configuration. In information-flow security (to which differential pri- 
vacy is related) this difference in attacker rnodels can be significant [U] . However 
it may be possible to prove that the passive rnodel of [12] is sound for the active 
model described here (c.f. a similar result for interactive noninterference El). 

Haeberlen et al JSj point out a number of flaws an covert channels in the PINQ 
system. Tliis may seem at odds with our claims for the soundness of PINQ, but 


in fact all the flaws described are either covert tirning channels (which we do 
not attempt to model), flaws in PINQ’s implementations of encapsulation, or 
failure to prevent unwanted side-effects, or combinations of these. Following this 
analysis, Haeberlen et al introduce a completely different approach to program- 
rning with differential privacy (an approach further developed and refined in ffH 
[7]) based on statically tracking sensitivity through sensitivity-types. This non- 
interactive approach is rigorously formalised and proven to provide differential 
privacy. 

Barthe et al (I) introduce a relational Hoare-logic for reasoning formally 
about the differential privacy of algoritlims. They include theorems relating to 
sequential and parallel composition of queries in the style of those stated by 
McSherry [9j. Unlike the present work, ! l] does not rely on differentially private 
primitives, but is able to prove differential privacy from first principles. 

7 Limitation and Extension 

In this section we discuss what we see as the main limitations of Feather- 
weight PINQ in relation to the PINQ system. We also discuss some easy exten- 
sions that become apparent from the proof of differential privacy. 

Parallel Queries The form of parallel query that we rnodel matches the infor- 
mal description in [9], but is not as general as the construct found in the imple- 
mentation. We believe that this is the rnain sliortcoming in the Featherweight 
PINQ model, as more general form is interesting, and thus its correctness is not 
immediately obvious. (Whether the shortcoming has any practical significance in 
the way one might write programs is less clear.) The difference was described in 
Section[4]in connection with Figure[2j which depicts a partition operation which 
is not supported by Featherweight PINQ since it is not immediately followed by 
queries on the partitions. In fact the queries in PINQ need not be parallel at 
all, but can be adaptive (i.e., a query on one partition can be used to influence 
the choice of query on other partitions). This change is not easily supported by 
a small change to our model since it does not seem to be implementable using 
Featherweight PINQ’s sirnple history-free use of explicit scaling factors. A proof 
of differential privacy for a more general protected system model encompassing 
this is left for future work. 

Extensions to the PINQ API We mention one extension to PINQ that 
emerges from the details of the correctness proof. In PINQ, the budget and 
the actual privacy cost of executing an e differentially private query on sorne 
intermediate table is not directly visible to the program: 

“An analyst using PINQ is uncertain whether any request will be accepted 

or rejected, and must simply hope that the underlying PINQAgents accept 

all of their access requests. ” [21 (§3.6) 

Recall that the key invariant that relates the two runs of the systems on neigh- 
bouring data sets (Definition [9| states that the budgets and the scaling factors 
in tlre respective environments are equal. This means that they contain no infor- 
mation about the sensitive data. This, in turn, means that we can freely permit 


the program to query thern. This would allow the analyst to calculate the cost 
of queries and to make accuracy decisions relative to the current privacy budget. 

Here we briefly outline this extension. We add two new actions to the set of 
program actions ProgAct, namely a query on the sensitivity of a table variable 
of the form tv ? s, where s £ N, and a query on the global budget, budget ? v 
where r £ R-°. The transition rules are given in Figure[4j 


Query sensitivity- 


P^P' 


Query budget 


(P,E,B) (P',E,B) 

p bLidget?fl ; p, 


-where E(tv) = (T, s) 


(P,E,B) (P',E,B) 

Fig. 4. Budget- and Scaling-Factor-Query 


8 Conclusion 

We started by presenting some shortcomings(gaps) between the theory of dif- 
ferential privacy and the implementation of PINQ framework. To verify privacy 
assurance of analysis written in PINQ framework and to address the mentioned 
concerns, we introduced an idealised model for the implementation of PINQ. In 
the model, only PINQ’s internal implementation has direct access to the sensitive 
data. An analysis written in this framework has indirect access to the protected 
system by calling some limited well defined/crafted interface APIs. In addition 
to the standard PINQ APIs, we extended the rnodel with our own proposed 
APIs responsible to retrieve scaling factor and the budget frorn the protected 
environment. Furthermore we instantiated the definition of differential privacy 
to prove any analysis constructed in this setting and its communications with 
protected systern would not violate the privacy guarantee promised by PINQ. 

We believe that our model (and our general approach to modelling such sys- 
tems) could be of benefit to formalise emerging variants on the PINQ framework, 
such as wPINQ [TD], or Streaming PINQ [15] . 
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A Proof of Theorem [l] 

Proof. Assume Init(T, B ) 4> p C. We proceed by induction on the length of the 
trace t, and by cases according to the last step of the trace. 

Base case: t = []. In this case p = q = 1 and C = Init (T,B) and C' = 
Init(T', B). So e = e' = 0 and C ~ C'. 

Inductive step: t = t\a. Suppose that Init (T,B) ==> Pl {P\,E\,B\) A P2 
(P, E,B) = C, and hence that p = pip 2 - 

The induction hypothesis gives us q\ , P\, E[ and £\ such that 

Init(T',B)4 9l {P 1 ,E[,B 1 ) (1) 

E x ~ E[ (2) 

Pi<qi.exp{B-ei) (3) 

by cases that is applied to the rule as the last transition {{P\,Ei,Bi) A P2 
(P, E, B)) we havep 2 = 1 except for query execution and that {Pi,E[,Bi) Aq C' 
for some C'. In those cases it follows that p < q ■ exp (B — e) by taking e = £i 
and using @. 

Case 1: Silent. In this case a = t and Pl M P. 

(Pi,Ei,Bi) Mi C= (Pi,Ei,Bi) 

(Pi,E[,Bi) Mi C' = (Pi,E[,Bi) 

It follows directly from ([2]) that C ~ C'. 


Case 2: Assign. Here P\ p^ anc l so we have 

<C = (P, E\[tv e+ (T, s)], Hi) 

C 7 = (P, E[[tv !->■ (T', s)], Bi) 

where for * £ (1 ,,n) 

Ei(tVi) = (: Ti,Si) 

stability(F) = (ci,...,c„) 

S = ^ Ci X Si 

E" 

T = [J^K^i, • ■ • ,T n ) 

T'= IF}(T[,...X) 

From <[2j) we have Ei(tvi) ~ E((tvi) which means Si = s' and Ti ~ Si T'. Using 
similarity definition and Lemma[l]we have T ~ s T' and hence we have <D ~ C'. 

Case 3: Query. The result of query execution depends on the remained 
budget and the sensitivity of the table that the query is executed on. If privacy 
budget is insufficient an exception is thrown to inform the program about tlie 
shortage of budget, otherwise each query in the list of queries will be executed 
on its corresponding partition and the result of execution is returned as a list of 
values, 'v. 

Case 3.1: Query(run out of budget). Here we have a rule instance of the 
form: 

f E(tv) = (T,s) 


Queryx - 


p query (tv,f,Q,e) ? p 


-where ■ 




£ ■ s > B 


(P,E,B) -±>1 (P',E,B) 

In this case C ~ C' and is similar to silent case. 

Case 3.2: Query. Similarly we have a rule instance of the form: 

( E(tv) = (T, s), e ■ s < B 


Query 


p query(«■»,/, Q,e) ? 

(P,E,B) \ (P',E,B — s ■ e) 


-where 


{ codomain(/) = {l,...,n} T e Val'' 
Ti = {s\sG T,f(s) = i},i e {1,..., 
P = nr=i p AQi(s,Ti) = Vi\ 


Hence we liave a transition : (Pi,E\,Bi) C = (P, E, B) and the analogous 

transition : (Pi,E{,Bi) -4 g2 C' = (P,E',B). £ = £i + (s-£ 2 ) is the value needed 
for theoremQ] 

For parallel queries on disjoint set we have the following equation: 

-* n 

Pr[P, ’ % P] = J] Pr[Qi(s ■ S2,T,) = „,] 

2 — 1 

Here we need to show that the following inequality is valid: 

n n n 

]T[ p r[Qi(s • £ 2 ,Ti) = Vi\ < qPr[Qi(s • £ 2 ,T[) = v(} x J|exp(e 2 x | T, - T' |) 








From £ILi(I Ti — T[ |) = s, we have n”=i exp(e 2 x | — T[ |) < exp(e 2 x s) 

which we conclude: 

n n 

JJPr[Qi(s • e 2 ,Tj) = v t ] < J^Pr[Q,(s • e 2 ,T/) = u'] x exp(e 2 • s) 

i=l i=l 

These parallel queries provide (s • £)-differential privacy which means: 

p 2 <q 2 - exp(e 2 • s) 

Multiplying two sides of the previous inequality with ([3]) we get: 

Pi ■ P2 < qi ■ 92 • exp(ei) • exp(e 2 • s) 

Knowing = B — eq result in choosing e to be £ = B — eq — (e 2 • s). Finally 
it is easy to see C ~ C' as the proper reduction in the global budget is the only 
cliange in the configuration. □ 


